Duration prediction using multi-level model for GPR-based speech synthesis
نویسندگان
چکیده
This paper introduces frame-based Gaussian process regression (GPR) into phone/syllable duration modeling for Thai speech synthesis. The GPR model is designed for predicting framelevel acoustic features using corresponding frame information, which includes relative position in each unit of utterance structure and linguistic information such as tone type and part of speech. Although the GPR-based prediction can be applied to a phone duration model, the use of phone duration model only is not always sufficient to generate natural sounding speech. Specifically, in some languages including Thai, syllable durations affect the perception of sentence structure. In this paper, we propose a duration prediction technique using a multi-level model which includes syllable and phone levels for prediction. In the technique, first, syllable durations are predicted, and then they are used as additional contexts in phone-level model to generate phone duration for synthesizing. Objective and subjective evaluation results show that GPR-based modeling with multi-level model for duration prediction outperforms the conventional HMM-based speech synthesis.
منابع مشابه
A comparison of speech synthesis systems based on GPR, HMM, and DNN with a small amount of training data
In this paper, we evaluate a framework of statistical parametric speech synthesis based on Gaussian process regression (GPR) and compare it with those based on hidden Markov model (HMM) and deep neural network (DNN). Recently, for the purpose of improving the performance of HMM-based speech synthesis, novel frameworks using deep architectures have been proposed and have shown their effectivenes...
متن کاملCombining extreme learning machine and decision tree for duration prediction in HMM based speech synthesis
Hidden Markov Model (HMM) based speech synthesis using Decision Tree (DT) for duration prediction is known to produce over-averaged rhythm. To alleviate this problem, this paper proposes a two level duration prediction method together with outlier removal. This method takes advantages of accurate regression capability by Extreme Learning Machine (ELM) for phone level duration prediction, and th...
متن کاملAnalysis of Duration Prediction Accuracy in HMM-Based Speech Synthesis
Appropriate phoneme durations are essential for high quality speech synthesis. In hidden Markov model-based text-tospeech (HMM-TTS), durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts. Use of rich context features enables synthesis without high-level linguistic knowledge. In this paper we analyze the accuracy ...
متن کاملAuto-Switch Gaussian Process Regression-based Probabilistic Soft Sensors for Industrial Multi-Grade Processes with Transitions
Prediction uncertainty has rarely been integrated into traditional soft sensors in industrial processes. In this work, a novel auto-switch probabilistic soft sensor modeling method is proposed for online quality prediction of a whole industrial multi-grade process with several steady-state grades and transitional modes. Several single Gaussian process regression (GPR) models are first construct...
متن کاملExplicit duration modelling in HMM-based speech synthesis using a hybrid hidden Markov model-multilayer perceptron
In HMM-based speech synthesis, it is important to correctly model duration because it has a significant effect on the perceptual quality of speech, such as rhythm. For this reason, hidden semi-Markov model (HSMM) is commonly used to explicitly model duration instead of using the implicit state duration model of HMM through its transition probabilities. The cost of using HSMM to improve duration...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015